Bound Ratis reconfiguration retries and add region migration ITs by CRZbulabula · Pull Request #17895 · apache/iotdb

CRZbulabula · 2026-06-10T02:08:12Z

Problem

When a DataNode that hosts a newly ADDING peer is killed during a Ratis region migration, the Ratis configuration change can stay in the staging state. The previous reconfiguration client retried these failures forever, so the DataNode add-peer task could remain PROCESSING indefinitely and the ConfigNode migration procedure could not fail and roll back.

Changes

Bound Ratis reconfiguration retries with a new reconfigurationMaxRetryAttempts client setting. Reconfiguration requests still use a fixed 2s retry interval, but now stop after the configured attempt count and propagate the last failure to the caller.
Renamed the reconfiguration retry client factory/policy from the old endless-retry naming to ReconfigurationRetryFactory and RatisReconfigurationRetryPolicy.
Added per-scope retry knobs:
- config_node_ratis_reconfiguration_max_retry_attempts
- schema_region_ratis_reconfiguration_max_retry_attempts
- data_region_ratis_reconfiguration_max_retry_attempts
Set the default retry count to 15, which is roughly 30 seconds with the fixed 2s interval. This keeps the timeout close to the default ConfigNode Unknown detection window while still leaving a small buffer for transient delays.
Propagated schema/data reconfiguration retry settings from ConfigNode to DataNodes through optional TRatisConfig fields. The fields are optional so rolling upgrades can fall back to local DataNode defaults when an old ConfigNode does not send them.
Wired the new retry setting through ConfigNode, DataNode, and the Ratis consensus config builders for ConfigNode, schema-region, and data-region consensus.
Added integration-test config setters for the new Ratis reconfiguration retry knobs.
Extended the region migration reliability IT framework with a failure-and-rollback path that waits for kill points, verifies the region replica set returns to the original DataNode set, and verifies the target peer is cleared.
Added a Ratis DataNode kill point during destination local peer creation so the ADDING-peer crash path can be reproduced.

Ratis IT Coverage

The branch adds Ratis region migration ITs to match the IoTV1 coverage pattern and cover Ratis-specific consensus-pipe states:

DataNode crash while creating the destination local peer, expecting migration failure and rollback.
ConfigNode crash at add-region states:
- CREATE_NEW_REGION_PEER
- CREATE_CONSENSUS_PIPES
- DO_ADD_REGION_PEER
- UPDATE_REGION_LOCATION_CACHE
ConfigNode crash at remove-region states:
- TRANSFER_REGION_LEADER
- REMOVE_REGION_PEER
- DELETE_OLD_REGION_PEER
- DROP_CONSENSUS_PIPES
- REMOVE_REGION_LOCATION_CACHE
ConfigNode crash with all add/remove migration kill points enabled in one migration.
Cluster crash at the IoTV1-covered migration states, plus Ratis CREATE_CONSENSUS_PIPES and DROP_CONSENSUS_PIPES.

The Ratis IT setup uses reconfiguration retry count 10 to keep failure-path tests bounded. The new Ratis IT classes are now tagged with DailyIT after passing CI.

Verification

mvn spotless:apply -pl integration-test -P with-integration-tests
mvn test-compile -pl integration-test -P with-integration-tests -DskipTests
mvn verify -DskipUTs -Dit.test=IoTDBRegionMigrateAddingPeerCrashForRatisIT,IoTDBRegionMigrateConfigNodeCrashForRatisIT,IoTDBRegionMigrateClusterCrashForRatisIT -DintegrationTest.includedGroups= -DintegrationTest.excludedGroups= -DfailIfNoTests=false -Dfailsafe.failIfNoSpecifiedTests=false -pl integration-test -P with-integration-tests
- 19 tests passed.
mvn spotless:check -pl integration-test -P with-integration-tests
git diff --check

When scaling in a DataNode, if a peer that is being ADDED to a Ratis schema/data region (during region migration) is killed before it catches up, the leader's reconfiguration can never commit. The Ratis client used for reconfiguration retried forever (retryForeverWithSleep), so the coordinator DataNode's AddRegionPeerTask blocked indefinitely inside setConfiguration, leaving the migration permanently stuck -- and CANCEL ineffective, since the task never left PROCESSING. Bound the reconfiguration retries instead of retrying forever: after the limit is exhausted the last failure propagates, so the upper layer can fail and roll back the migration (and CANCEL becomes reachable again). Expose the limit as a per-group config, pushed from ConfigNode to DataNode via TRatisConfig (optional fields, for rolling-upgrade safety): - config_node_ratis_reconfiguration_max_retry_attempts - schema_region_ratis_reconfiguration_max_retry_attempts - data_region_ratis_reconfiguration_max_retry_attempts Default 600 attempts at the fixed 2s interval (~20min cap). Also rename the now-misnamed EndlessRetryFactory / RatisEndlessRetryPolicy to Reconfiguration*, since the policy is no longer endless.

codecov · 2026-06-10T02:53:32Z

Codecov Report

❌ Patch coverage is 38.02817% with 44 lines in your changes missing coverage. Please review.
✅ Project coverage is 40.74%. Comparing base (07b9cb0) to head (9f02f64).
⚠️ Report is 2 commits behind head on master.

Files with missing lines	Patch %	Lines
...he/iotdb/confignode/conf/ConfigNodeDescriptor.java	0.00%	12 Missing ⚠️
...apache/iotdb/confignode/conf/ConfigNodeConfig.java	25.00%	9 Missing ⚠️
...ol/thrift/impl/DataNodeInternalRPCServiceImpl.java	0.00%	7 Missing ⚠️
...java/org/apache/iotdb/db/conf/IoTDBDescriptor.java	0.00%	6 Missing ⚠️
...che/iotdb/confignode/manager/node/NodeManager.java	0.00%	4 Missing ⚠️
...ain/java/org/apache/iotdb/db/conf/IoTDBConfig.java	50.00%	4 Missing ⚠️
...confignode/manager/consensus/ConsensusManager.java	0.00%	2 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #17895      +/-   ##
============================================
+ Coverage     40.68%   40.74%   +0.06%     
- Complexity     2620     2621       +1     
============================================
  Files          5244     5244              
  Lines        362374   362696     +322     
  Branches      46653    46687      +34     
============================================
+ Hits         147419   147793     +374     
+ Misses       214955   214903      -52

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

jt2594838 · 2026-06-10T06:15:46Z

+  /**
+   * RatisConsensus protocol, max retry attempts for a configuration change (add/remove peer). Uses
+   * a fixed 2s retry interval; bounding the attempts stops a killed ADDING peer from blocking the
+   * reconfiguration -- and hence a region migration -- forever.
+   */
+  private int configNodeRatisReconfigurationMaxRetryAttempts = 600;
+
+  private int dataRegionRatisReconfigurationMaxRetryAttempts = 600;
+  private int schemaRegionRatisReconfigurationMaxRetryAttempts = 600;
+


Will 20 minutes be too long? May use a similar interval, referring to how CN determines whether a node is unknown.

Agreed. 20 minutes is too long for this failure path. I changed the default reconfiguration retry attempts from 600 to 15, which is about 30 seconds with the fixed 2s retry interval. This keeps it close to the ConfigNode Unknown detection window while leaving a small buffer for transient delays. The newly added Ratis region migration ITs have also been moved into DailyIT after the CI passed.

sonarqubecloud · 2026-06-10T07:19:28Z

Quality Gate passed

Issues
1 New issue
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

CRZbulabula mentioned this pull request Jun 10, 2026

Bound Ratis reconfiguration retries to avoid stuck region migration #17894

Closed

CRZbulabula added 2 commits June 10, 2026 12:52

Add Ratis region migration rollback IT

fbde77c

Add more Ratis region migration ITs

56ba04a

CRZbulabula changed the title ~~Bound Ratis reconfiguration retries to avoid stuck region migration~~ Bound Ratis reconfiguration retries and add region migration ITs Jun 10, 2026

jt2594838 approved these changes Jun 10, 2026

View reviewed changes

Tune Ratis retry default and enable daily ITs

9f02f64

CRZbulabula merged commit 3030597 into master Jun 10, 2026
44 of 46 checks passed

CRZbulabula deleted the enhance-ratis-reconfiguration-retry-limit branch June 10, 2026 08:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bound Ratis reconfiguration retries and add region migration ITs#17895

Bound Ratis reconfiguration retries and add region migration ITs#17895
CRZbulabula merged 4 commits into
masterfrom
enhance-ratis-reconfiguration-retry-limit

CRZbulabula commented Jun 10, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Jun 10, 2026 •

edited

Loading

Uh oh!

jt2594838 Jun 10, 2026 •

edited

Loading

Uh oh!

CRZbulabula Jun 10, 2026

Uh oh!

sonarqubecloud Bot commented Jun 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

CRZbulabula commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Changes

Ratis IT Coverage

Verification

Uh oh!

codecov Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jt2594838 Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CRZbulabula Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

sonarqubecloud Bot commented Jun 10, 2026

Quality Gate passed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

CRZbulabula commented Jun 10, 2026 •

edited

Loading

codecov Bot commented Jun 10, 2026 •

edited

Loading

jt2594838 Jun 10, 2026 •

edited

Loading